[wip] Add trace-aware stage summaries + plotting helper for profiling #1168
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
At the Ray stage we now log every major hop a document takes: queue time before the worker picks it up, the full pdf_extractor resident time, and downstream stages (YOLOX ensembles for tables/charts, OCR/text extraction, metadata construction, embedding, storage, etc.). Those show up in results.wall_time.png and the stage-time bar chart.
Inside the PDF extractor we added sub-spans for the previously opaque rasterization leg—rendering the page via PDFium, copying the bitmap into NumPy, scaling to YOLOX size, and padding. Those spans feed the new PDFium breakdown chart and CSV so we can compare per-document/per-page cost. You can now say “document 2062555.pdf spent ~0.65 s/page in scaling, which is 60% of its PDF extractor time” instead of just “this doc was slow.”
The combination of stage-level metrics (queue/wall/resident) plus the PDFium micro-spans gives a holistic view: you can see how much time each document spends waiting in Ray, how much is consumed by the PDF extractor as a whole, and exactly which sub-step dominates inside that extractor.
Task List
enable_traces/trace_output_dirthrough the test config and e2e case so trace payloads are captured automatically during scripted runs.trace_summarygeneration inscripts/tests/cases/e2e.py, writing per-stage aggregates plus per-document totals;run.pynow records trace flags inresults.json.scripts/tests/tools/plot_stage_totals.py, a helper that reads anyresults.jsonand emits a PNG + textual summary showing cumulative resident seconds per stage (with options to sort, collapse nested entries, filter network noise, etc.).Testing:
ENABLE_TRACES=true; verifiedresults.jsoncontains the newtrace_summary, trace files land underartifacts/.../traces/, and the plotting tool produces the expected charts (*.stage_time.png) using both collapsed and nested views.Checklist